Here in this note book we looked at weather there was any statistically significant difference between the average size of a family, and the average household income. The theory is that families on the more extreme side of the spectrum, will have a reduced income either as a result of not being able to sustain children, while larger families might have resulted from poor birth control in poor environments. Written by Alex Fontani!


In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# Load data
from sklearn import linear_model
import statsmodels.api as sm

%matplotlib inline
city_data = pd.read_csv('USData_ClassProject1.csv')

In [28]:
y = city_data.MeanHouseholdIncome
x = city_data.AvgHighTemp
plt.scatter(city_data.AverageHouseholdSize,city_data.MeanHouseholdIncome)
plt.xlabel('Average Household Size')
plt.ylabel('Mean Household Income')


Out[28]:
<matplotlib.text.Text at 0x12fcf190>

Seems interesting, it looks like that people with an average household size of 3, make a bit more, let's plot the distribution of Family Size with a histogram:


In [29]:
city_data.AverageHouseholdSize.hist()


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x13543b30>

As you can see, this historgram shows us a much greater population with an average of 2 or 3 children. Let's create a box plot and investigate the average income per family size more closely:


In [31]:
df = city_data
df['AverageHouseholdSize'][(df.AverageHouseholdSize >=0) & (df.AverageHouseholdSize <0.4)] = 0
df['AverageHouseholdSize'][(df.AverageHouseholdSize >=0.4) & (df.AverageHouseholdSize <1.4)] = 1
df['AverageHouseholdSize'][(df.AverageHouseholdSize >=1.4) & (df.AverageHouseholdSize <2.4)] = 2
df['AverageHouseholdSize'][(df.AverageHouseholdSize >=2.4) & (df.AverageHouseholdSize <3.4)] = 3
df['AverageHouseholdSize'][(df.AverageHouseholdSize >=3.4) & (df.AverageHouseholdSize <4.4)] = 4
df['AverageHouseholdSize'][(df.AverageHouseholdSize >=4.4) & (df.AverageHouseholdSize <5.4)] = 5
df['AverageHouseholdSize'][df.AverageHouseholdSize >=5.4] = 6
df.boxplot(column='MeanHouseholdIncome', by='AverageHouseholdSize', grid=False)

#df[['MeanHouseholdIncome','AverageHouseholdSize']]


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x12fe37b0>

As you can see, the mean income for a household of 3 is slightly higher than the rest, which is an interesting result. However, I still don't think that it's significant.